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1  Introduction 

A  sparse  signal  is  a  signal  that  has  very  few 
nonzero  elements  or  one  that  becomes  so  under  a 
basis  change  or  through  a  certain  transform.  Ex¬ 
ploiting  sparsity  has  become  a  common  task  in 
data  sciences.  Compressed  sensing  [1,  2],  regu¬ 
larized  regression  (e.g.,  LASSO  [3]),  and  regular¬ 
ized  inverse  problems  (e.g.,  total  variation  image 
reconstruction  [4])  have  made  l\  optimization  a 
central  tool  in  data  processing  problems.  As  the 
name  suggests,  £\  optimization  problems  recover 
sparse  solutions  by  solving  an  optimization  prob¬ 
lem  involving  an  A -norm. 

Today,  the  scope  of  £\  optimization  is  quickly 
expanding.  The  size,  complexity,  and  diversity 
of  instances  have  grown  significantly.  Beyond  ID 
signals  and  2D  images,  high- dimensional  quanti¬ 
ties  such  as  video,  4D  CT,  and  multi-way  ten¬ 
sors  have  become  the  data  or  unknown  vari¬ 
ables  in  models.  New  applications  have  moti¬ 
vated  structured  solutions  to  optimization  prob¬ 
lems  that  significantly  generalize  our  notion  of 
sparsity.  Such  applications  look  for  low-rank  ma¬ 
trices  or  tensors,  sparse  graphs,  tree  structured 
data  representations,  and  sparse  representations 
involving  only  a  few  dictionary  atoms. 

This  article  gives  self-contained  introductions 
to  t\  optimization  for  sparse  vectors  (Section  2), 
L\  optimization  for  finding  functions  with  com¬ 
pact  support  (Section  3),  and  computing  sparse 
solutions  from  measurements  that  are  corrupted 
by  unknown  noisy  (Section  4). 

2  Can  we  trust  t\  optimization? 

Let  A  be  an  m  X  n  matrix  and  let  b  E  Mm  be 
a  vector.  Suppose  we  wish  to  find  the  sparsest 


solution  to  the  linear  equations  Ax  =  b.  Mathe¬ 
matically,  this  problem  is  equivalent  to  minimiz¬ 
ing  the  £q—  “norm”  of  x  (denoted  by  ||x||o),  which 
counts  the  number  of  nonzero  entries  of  x,  sub¬ 
ject  to  Ax  =  b.  However,  this  is  a  combinatorial 
problem  and  is  generally  NP-hard  [5].  A  com¬ 
putationally  tractable  alternative  is  to  perform 
the  ^i— norm  minimization  in  place  of  £q—  “norm” 
minimization.  We  call  this  problem  the  basis 
pursuit  problem: 

minimize  ||x||i  subject  to  Ax  =  b.  (1) 

iei"  w 

Note  that  the  U-norrn  is  a  convex  function.  Al¬ 
together,  problem  (1)  is  a  convex  optimization 
problem.  Before  discussing  the  numerical  solu¬ 
tion  to  (1),  we  should  question  the  quality  of  the 
£o~to~£i  relaxation:  if  x  E  Mn  is  the  unknown 
sparse  vector  and  b  =  Ax,  can  we  trust  (1)  to 
recover  x? 

First,  whenever  the  linear  system  Ax  =  b  has  a 
unique  solution  x,  x  is  the  unique  solution  to  (1). 
In  this  case,  minimizing  the  £ i -norm  is  unneces¬ 
sary.  Therefore,  it  is  more  interesting  to  con¬ 
sider  the  under-determined  case,  when  Ax  =  b 
has  infinitely  many  solutions.  How  can  we  find 
the  needle  x  in  the  haystack  {x  :  Ax  =  6}? 

Without  loss  of  generality,  let  us  assume  that 
matrix  A  has  full  row  rank,  or  otherwise  some 
rows  of  Ax  =  b  can  be  removed  without  affecting 
the  solution  of  (1).  In  this  case,  A  is  an  m-by-n 
matrix,  where  m  <  n.  We  now  introduce  some 
notation  that  we  will  use  to  present  conditions 
that  guarantee  several  properties  of  x. 

Let  5  =  {i  :  Xj  A  0}  denote  the  set  of  nonzero 
elements  of  x  (a.k.a.,  the  support  of  x),  let  A{ 
denote  the  zth  column  of  A,  and  let  A$  denote 
the  submatrix  of  A  formed  by  the  columns  As 
for  i  E  S. 

When  S  is  fixed,  the  necessary  and  sufficient 
condition  for  model  (1)  to  return  x  uniquely  is 

la.  As  has  full  column  rank,  and 

lb.  there  exists  a  “ dual  certificate ”  denoted  by 
y  E  Mm  such  that  obeys 

i)  ( y,Ai )  =  sign(xj),  for  i  E  S,  and 
h)  -1  <  (y,  Aj)  <  1,  for  j  0  S. 
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Condition  la  basically  says  that  if  an  oracle  tells 
us  that  all  nonzero  elements  of  x  fall  within  S, 
then  we  can  solely  rely  on  the  linear  subsystem 
Agxs  =  b  to  recover  xs,  which  is  the  nonzero 
part  of  x.  Minimizing  £i-norm  cannot  help  here 
because  it  is  a  locally  linear  function.  If  con¬ 
dition  la  is  not  satisfied,  then  x  cannot  be  the 
unique  solution.  Indeed,  in  this  case  there  exists 
a  nonzero  vector  t  E  Mn  such  that  At  =  0  and 
supp(i)  C  S.  Consider  xa  =  x  +  at.  Restrict¬ 
ing  a  to  the  interval  (— e,  e)  for  some  sufficiently 
small  e,  we  have  sign(x)  =  sign(x  +  at)  and 
thus  \\xa || i  =  (sign(x),  xQ)  =  (sign(x),  x  +  at)  = 
||x||i+a(sign(x),  t).  Therefore,  there  exists  some 
a  /  0  such  that  ||xQ||i  <  ||x||i.  Together  with 
Axa  =  Ax  =  b,  x  cannot  be  the  unique  solution 
to  (1). 

Condition  la  also  implies  that  \S\,  the  number 
of  nonzero  components  in  x  ,  must  obey  \S\  <  ru¬ 
in  general,  we  clearly  need  to  take  at  least  151 
linear  measurements  in  order  to  recover  a  signal 
with  |S|  nonzero  elements.  Later  we  will  discuss 
how  large  m  needs  to  be. 

Condition  lb  reveals  the  power  of  £i-norm 
minimization:  when  a  dual  certificate  y  exists, 
it  determines  S.  To  see  this,  suppose  x  E  Mn 
satisfies  Ax  =  b  but  Xj  ^  0  for  some  j  f  S. 
Given  a  dual  certificate  y,  we  will  show 

IMIi  >  (y,  Ax)  =  (y,  Ax)  =  ||x||i  (2) 

and  thus  x  cannot  be  a  solution  to  (1).  Proof 
of  (2):  For  i  E  S,  by  condition  lb  (i),  we 
have  \xi\  =  ( y,Af)xi  and  \xi\  >  ( y,Af)xi . 
For  any  j  S  and  Xj  ^  0,  by  condition 
lb  (ii),  we  have  \xj\  >  (y,Aj)xj.  Therefore, 

Plli  =  =  (y,Ax)  and  Plli  > 

E ies(y>  Ai)xi  +  T,j<ts(y’  Ai)xj  =  (viAx)-  QED 

The  two  conditions  have  a  nice  geometric  in¬ 
terpretation.  Consider  the  hyperplane  H  = 

Mn  :  pTx  =  a},  where  p  =  ATy  and  a  =  yTb, 
and  the  ^-“ball”  B  =  {x  E  Mn  :  ||x||i  <  (3}, 
where  /3  =  pslli-  Condition  lb  ensures  that 
PL  PI  B  is  the  face  of  B  where  sign(x)  =  sign(x). 
Condition  la  further  ensures  that  this  face  in¬ 
tersects  {i  6  I"  :  Ax  =  b}  at  exactly  one  point, 
which  is  xs-  Altogether,  they  ensure  that  xs  is 
uniquely  recovered  by  (1). 


We  already  saw  that  Condition  1  is  sufficient 
for  (1)  to  uniquely  recover  x.  In  fact,  it  is  also 
necessary.  In  addition,  it  is  both  necessary  and 
sufficient  for  the  following  relaxed  problems  to 
have  a  unique  solution: 


minimize  All x lb  +  -||  Ax  —  bill,  (3) 
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minimize  ||x|h  subject  to  ||  Ax  —  blW  <  5:  (4) 

zeR™ 


see  [6]  for  proofs.  The  same  condition  also  guar¬ 
antees  that,  when  b  contains  noise  and/or  when 
x  are  not  exactly  but  approximately  sparse,  the 
solutions  to  (3)  and  (4)  remain  close  to  x  in 
certain  norms  (assuming  appropriate  parameters 
A  and  5 ,  respectively);  a  generalized  condition 
gives  similar  properties  for  the  analysis-^i  model 
[7],  where  the  signal  is  (approximately)  sparse 
under  a  linear  transform  T  and  thus  ||\l/x||i,  in¬ 
stead  of  p||i,  is  minimized;  see  [8]  and  the  ref¬ 
erences  therein.  The  total  variation  model  [4]  is 
a  well-known  example. 

The  existence  of  a  dual  certificate  is  the  key 
to  the  success  of  t\  optimization.  Given  |5|  (the 
number  of  nonzero  elements  in  the  signal),  the 
set  of  vectors  y  obeying  Condition  lb  part  (i)  is 
larger  if  m  is  larger.  Now  fixing  m,  the  set  of  vec¬ 
tors  y  obeying  part  (ii)  is  larger  if  n  is  smaller. 
Therefore,  recovery  by  l\  optimization  is,  in  gen¬ 
eral,  more  likely  to  succeed  if  there  are  a  lot  of 
linear  measurements  and  just  a  few  nonzero  com¬ 
ponents,  which  is  intuitive. 

The  main  condition  listed  above  is  based  on 
a  fixed  support  set  S.  It  is  numerically  veri¬ 
fiable  only  when  S  is  either  given  (by  an  ora¬ 
cle)  or  known  to  have  moderately  many  possi¬ 
bilities.  However,  there  are,  in  general,  expo¬ 
nentially  many  possible  support  sets  S,  and  the 
correct  set  is  hard  to  guess  in  advance.  How  can 
we  ensure  the  existence  of  a  dual  certificate  for 
every  possible  sparse  signal  xl  Such  a  setting  is 
commonly  referred  to  as  “uniform  sparse  recov¬ 
ery.”  There  has  been  very  encouraging  answers 
based  on  conditions  such  as  the  mutual  incoher¬ 
ence  condition  [9,  10],  the  null-space  property 
(NSP)  [2,  11],  the  restricted  isometry  principle 
(RIP)  [1],  the  spherical  section  property  [12],  and 
so  on. 
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A  surprising  result  in  uniform  sparse  recov¬ 
ery  is  that  as  long  as  the  entries  of  A  are  sam¬ 
pled  from  subgaussian  distributions,  then  with 
overwhelming  probability,  all  sparse  signals  with 
no  more  than  k  =  0(m/  log(n/m))  nonzero  el¬ 
ements  can  be  uniquely  recovered  by  (1).  Fur¬ 
thermore,  even  if  b  contains  noise  and/or  x  is  ap¬ 
proximately  k  sparse,  the  recovery  remains  sta¬ 
ble.  The  constant  in  the  big-0  is  very  mild.  The 
result  says  that  we  shall  trust  t\  optimization 
for  recovering  any  sparse  signal  from  a  number 
of  qualified  linear  observations  that  are  merely  a 
few  times  more  in  number  than  the  nonzero  ele¬ 
ments  in  the  signal.  We  pay  a  mild  price  for  not 
knowing  the  locations  of  the  nonzero  elements 
because  of  the  help  of  i\  minimization. 

3  L\  optimization  for  com¬ 
pactly  support  solutions 

We  have  seen  the  power  of  t\  optimization  for 
inducing  sparsity  in  finite  dimensional  problems. 
In  this  section,  we  consider  extensions  of  t\  opti¬ 
mization  to  infinite  dimensional  calculus  of  vari¬ 
ations  type  problems  arising,  for  example,  in 
physics  and  elsewhere. 

We  can  think  continuously  and  start  with  a 
very  simple,  but  canonical  example.  Let  u,  f  : 
M1  — >  M,  u  £  H1,  f  e  L2,  and  consider  the  toy 
problem: 

minimize-  [  |itr|2  —  f  fu-\ — 

ugh1  2  J  J  p 

(We  will  abuse  notation  below  and  let  ||  ■  || 2,  ||  *  ||i 
now  be  the  continuous  L2  and  L\  norms  and  (  ,  ) 
be  the  continuous  L 2  inner  product.) 

When  minimizing  this  problem,  we  are  led  to 
the  Euler-Lagrange  equation 

uxx  +  f=-p(u),  (5) 


We  might  also  consider  gradient  descent  on 
this  toy  problem,  with  t  being  the  descent  direc¬ 
tion,  obtaining 


ft  -  Uxx  T  f 


(6) 


as  an  evolution  equation. 

We  can  hope  that  these  L\  regularized  (or  per¬ 
turbed)  problems  will  have  solutions  that  vanish 
a  lot,  e.g.,  have  compact  support.  The  theo¬ 
retical  framework  for  this  was  developed  by  H. 
Brezis  in  some  important  papers  from  the  early 
1970’s  [13,  14],  without  any  connection  to  com¬ 
pressive  sensing  and  without  any  suggested  nu¬ 
merical  implementation.  He  considered  a  wide 
class  of  second-order  elliptic  equations  and,  with 
Friedman  [14],  an  extension  to  parabolic  equa¬ 
tions.  In  [15,  16]  we  showed  that  many  inter¬ 
esting  problems  of  physics  can  be  rewritten  in 
this  L 1  form  and  demonstrated  advantages,  both 
numerically  and  in  physical  understanding  that 
arises  from  this  approach. 

It  is  instructive  to  give  the  following  formal 
argument,  which  helps  explain  why  the  measure 
of  the  support  shrinks  as  p  0.  Consider  a  so¬ 
lution  to  (5)  on  an  interval  x\  <  x  <  X2  with 
u(x  1)  =  u(x 2)  =  0  and  u(x)  >  0  for  x\  <  x  <  X2- 
Hence,  p{u(x))  =  1  for  x\  <  x  <  X2- 

Integrating  (5)  from  x  1  to  X2  gives  us 


=  x2-  Xl, 


but  ux(x 2)  <  0  <  ux(x  1)  so: 

rx  2 

%2  —  x\  <  p,  /  f(x)dx. 

J  X\ 

This  gives  a  bound  on  the  interval  in  terms  of 
/,  which  diminishes  with  p.  Similarly,  if  instead 
u(x  1)  =  u(x 2)  =  0  and  u(x)  <  0  for  x\  <  x  <  X2, 
then  we  have  p(u(x))  =  —  1  for  x\  <  x  <  X2  and 


where  p(u)  is  a  subgradient  of  ||u||i.  We  have 
Hi  =  ( u,p(u ))  and,  for  any  v, 


=  X2  -  Xl. 


\V  1  - 


u||l  >  (v  -  u,p(u)), 
fill  >  (v,p(u)}. 


This  time 


-ux(x 2)  <  0  <  -ux(xi),  so 
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Note  that  computing  (/  +  tB)  1g  =  v  means 
that  we  are  solving  for  v  in 


X2  —  xi  < 


-H  /  f(x)dx 
J  X\ 


and  we  get  the  same  kind  of  estimate.  This  for¬ 
mal  argument  can  be  generalized  to  a  wide  class 
of  elliptic  problems. 

We  can  borrow  computational  techniques  from 
t\  optimization  to  devise  efficient  and  novel  nu¬ 
merical  methods  for  these  and  a  wide  variety  of 
classical  problems.  The  key  tool  from  a  numeri¬ 
cal  point  of  view  is  the  simple  “soft  thresholding” 
or  “shrink”  operator.  Recall: 


g  =  v  +  T[ip(v) 


or  solving 


minimize  ||w 

V 


i  + 


2  Tfl 


\V  -  9 1 


The  solution  is 


v  =  shrink(g,  rp). 


shrink(x, u)  =  argmin  p\y\  +  -\x  —  yY 
y  2 


= 


x-  9, 
0, 

x  +  p, 


x  >  n, 
\x\  <  g,, 
x  <  —p. 


In  [16]  we  applied  this  approach  to  PDEs  that 
come  from  a  variational  problem,  either  by  min¬ 
imization,  obtaining  an  elliptic  PDE,  or  by  gradi¬ 
ent  descent  to  obtain  a  parabolic  PDE.  Addition¬ 
ally,  some  PDEs  can  be  rewritten  using  the  L 1 
subgradient  such  as  the  divisible  sandpile  prob¬ 
lem  and  the  signum-Gordon  equation  [15].  Given 
a  linear  second  order  elliptic  operator  C(u),  we 
would  like  to  solve  numerically 


0  <E  -C(u)  -  f  +  pp(u), 

0  G  ut  -  C(u)  -  f  +  pp(u). 

Let  Au  =  —C(u)  —  /,  Bu  =  pp(u),  and  r  be 
the  time  step  for  the  time  dependent  problem. 
A  very  convenient  implicit  and  unconditionally 
stable  method  is  known  as  the  Douglas-Rachford 
splitting  algorithm  [17].  Let 


which  is  simple  to  implement.  See  [17]  for  a  con¬ 
vergence  proof  of  this  method.  This  is  uncondi¬ 
tionally  stable  and  the  possible  multi-valuedness 
of  p(u)  gives  no  difficulties. 

In  [16]  we  constructed  an  efficient  numerical 
scheme  for  solving  obstacle  problems  in  the  di¬ 
vergence  form.  We  reformulated  the  problem  in 
terms  of  an  L 1  like  penalty  on  the  variational 
problem.  This  is  an  exact  regularizer.  The  tech¬ 
nique  also  applies  to  classical  obstacle  problems 
as  well  as  some  related  free  boundary  problems, 
e.g.,  Hele-Shaw  and  two  phase  membrane.  The 
resulting  methods  are  quite  simple,  again  involv¬ 
ing  the  shrink  operator,  and  seem  to  outperform 
classical  approaches. 

Perhaps  the  most  significant  application  in 
this  set  of  ideas  involves  obtaining  compactly 
supported  approximations  to  eigenfunctions  of 
the  Schrodinger  equation  [18,  19].  These  have 
long  been  sought  [20]  and  are  called  Wannier 
functions.  These  were  developed  in  solid  state 
physics  and  quantum  chemistry.  In  [18,  19,  21] 
we  developed  and  analyzed  a  natural  and  easy 
to  implement  method  to  do  this. 

Consider  the  Hamiltonian 


u  ~  u(kAt) 

then  update: 

uk+1  =  (1 +tB)~1  ((1  +  tA)~1(  1  —  tB)  +  tB )  u‘ 
which  can  be  written  as 
uk+1  =  (. I  +  TB)~luk , 

uk+1  =  uk  +  {  1  +  TA)-\2uk+1  -  uk)  -  uk+1. 


H  =  -\a  +  V(x), 

where  A  is  the  Laplacian  and  V  is  a  potential 
with  eigenvalues  Ai  <  A2  ■  ■  ■ . 

We  obtain  compactly  supported  approxima¬ 
tions  to  eigenfunctions  by  solving  the  variational 
problem 

N 

E0=  min  Y Htpj) 

J=1 
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subject  to  {ipi,  ifj)  =  djk,  for  i,j  =  1, . . . ,  N. 

We  get  densely  supported  tpj,  (think  of  sines 
and  cosines  when  V  =  0).  Physicists  and 
chemists  want  short-ranged  interaction.  The 
original  Wannier  functions  (1937)  involve  a  sub¬ 
space  rotation  of  the  ipj  following  by  a  cut-off 
to  get  compactly  supported  approximate  eigen¬ 
functions. 

We  just  add  an  L 1  regularization  in  the  previ¬ 
ous  variational  problem,  obtaining 

N  1 

E=  min  “Ill'll1  +  (7) 

VI ,W2,—,WN  ,  P 

J=1 

subject  to  (ipi,  ipj)  =  djk,  for  i,  j  =  1, . . . ,  N. 

It  turns  out  that  this  can  be  solved  rapidly 
using  the  split  Bregman  algorithm  with  an  extra 
(nonconvex)  projection  step  [22].  The  L 1  term 
actually  often  speeds  up  the  optimization! 

We  have  a  fairly  complete  approximation  the¬ 
ory  [23].  We  can  also  impose  shift  invariance, 
i.e  orthogonality  to  the  translations  of  the  eigen¬ 
functions  by  lattice  vectors  [24],  The  only  non¬ 
linear  steps  in  the  algorithm  are  very  simple 
scalar  operations.  The  resulting  approximate 
eigenfunctions  resemble  Meyer  wavelets  [25] ,  but 
have  compact  support  and  are  intimately  con¬ 
nected  to  the  Schrodinger  equation. 

4  Computing  Paths  of  Sparse 
Solutions 

When  the  vector  b  is  corrupted  due  to  noise  at 
an  unknown  level,  it  is  not  straightforward  to 
calculate  the  correct  value  of  A  in  (3).  Certain 
methods,  such  as  cross  validation,  exist  to  solve 
this  problem,  but  they  need  the  solutions  to  (3) 
corresponding  to  all  (or  largely  many)  parameter 
values  A  >  0.  While  solving  a  single  i\  problem  is 
inexpensive,  solving  (3)  for  the  entire  path  of  so¬ 
lutions  x\  for  all  A  >  0  can  be  time-consuming. 

In  addition,  an  unpleasant  by-product  of  min¬ 
imizing  1 1  -norm  in  model  (3)  is  the  loss  of  signal 
magnitude.  Consider  a  toy  problem  b  =  ax  +  e, 
where  e  is  noise  and  a ,  b,  x  are  strictly  positive. 


The  solution  to  (3)  is 

0,  A  >  ab, 

a  A  G  (0,  0.6]. 

Unless  A  =  0,  we  always  get  x\  <  b/a.  Roughly 
speaking,  model  (3)  returns  a  sparse  x\  by  re¬ 
ducing  the  magnitudes  of  its  components;  oth¬ 
erwise,  the  solution  will  have  many  nonzero  ele¬ 
ments  since  the  noise  e  cannot  be  sparsely  rep¬ 
resented.  However,  the  magnitudes  of  the  true 
nonzero  components  are  also  reduced,  causing 
solution  bias. 

This  section  describes  a  simple  solution  to  re¬ 
solve  these  issues.  We  restrict  our  discussion  to 
the  Euclidean  space  Mn.  The  optimality  condi¬ 
tion  of  (3)  is: 

0  =  \p  +  At(Axx  -  b),  p£d\\xx\\i-  (8) 

Introducing  A  =  l/t  and  then  replacing  A p  =  j 
in  (8)  by  so  that  we  can  evolve  p  over  time  t, 
we  arrive  at  the  new  system,  known  as  inverse- 
scale  space  (ISS)  [26,  27]: 

p(t)  = -AT(Ax(t) -b),  p  G  d\\x(t) ||i.  (9) 

This  is  an  ordinary  differential  inclusion,  for 
which  we  set  initial  solution  p( 0)  =  x(0)  =  0. 
For  well-definedness,  we  let  x  to  be  right  contin¬ 
uous,  let  p  be  right  continuously  differentiable, 
and  let  p  denote  the  right  time  derivative  of  p. 

It  is  easy  to  evolve  the  system  (9)  because  at 
each  time  t  >  0,  either  Pi(t)  is  changing  value  or 
Xi(t)  is  so,  but  not  both.  This  is  because  Pi(t)  is  a 
subgradient  of  |  Xi  (t)  | ,  so  Xi  ( t )  must  stay  0  when¬ 
ever  pi(t )  is  changing  value  between  (—1,1),  and 
once  Xi(t)  becomes  strictly  positive  or  strictly 
negative,  Pi(t)  must  stay  1  or  —1,  respectively. 
We  can  construct  a  solution  path  to  (9)  by  keep¬ 
ing  x  fixed  and  evolving  p,  at  all  but  a  set  of 
time  points  where  some  Pi(t)  reaches  either  1  or 
—  1.  At  those  times,  x  is  updated  as  follows.  Let 
Si  =  {?  :  Pi(t)  =  1},  S2  =  {i  :  Pi(t)  =  -1},  and 
T  =  (Si  U  S2)c.  Following  (9),  x(t)  is  a  solution 
to  the  system: 

xsi  >  0,  xs2  <  0,  xt  =  0,  (10a) 

0  =  As1uA2(Ax-b).  (10b) 
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Equation  (10b)  prevents  (9)  from  evolving  Pi{t) 
above  1  or  below  —1.  The  system  (10)  is  equiv¬ 
alent  to  the  problem 

minimize  -||Ax  —  b\\2  subject  to  (10a).  (11) 

The  entire  path  {p(t),  x(t)}t>o  can  be  obtained 
by  alternating  between  evolving  p(t)  by  (9)  and, 
upon  some  Pi(t)  hitting  either  1  or  —1,  updat¬ 
ing  x[t)  by  solving  (10)  or  (11).  Note  that  Ax(t) 
is  always  unique  even  if  x(t)  is  not,  so  the  path 
{p{t),  Ax(t)}t>o  is  unique.  When  Ax  =  b  is  con¬ 
sistent,  there  exists  T  >  0  such  that  x(t)  =  x(T), 
for  t  >T,  and  x(T)  is  a  solution  to  (1). 

Applying  model  (9)  to  the  toy  example  gives 

0,  \  >  ab, 

a>  i  e  (°, a6]. 

Notice  that  the  bias  —  ^  is  gone! 

Of  course,  one  can  manually  add  —  ^  back 
to  the  solution  of  (3)  and,  in  the  general  case, 
manually  update  x\  by  solving  an  additional  de¬ 
biasing  problem,  for  example,  minimize  ^  \ \  Ax  — 
6|j2  subject  to  x  on  the  same  support  as  x\.  How¬ 
ever,  this  will  not  recover  the  solution  path  x(t) 
of  ISS.  In  general,  x(t)  and  x\,  for  A  =  1/t,  do 
not  have  the  same  support.  This  is  because  bias 
not  only  reduces  magnitude  but  also  affects  the 
support  of  x\.  Debiasing  only  changes  the  val¬ 
ues  of  x\,  not  its  support.  Therefore,  introduc¬ 
ing  bias  and  removing  it  are  not  as  effective  as 
avoiding  bias  at  the  beginning. 

Without  minimizing  the  £ i  -norm  in  (9),  is  x(t) 
still  sparse?  The  answer  is  interesting:  x(t) 
and  x\  are  both  sparse  for  the  same  reason: 
p  6  Range(AT),  which  holds  for  both  (8)  and 
(9).  In  our  work,  we  deal  with  the  underdeter¬ 
mined  case  where  A  has  more  columns  than  rows, 
so  p  stays  in  a  small  m-dimensional  subspace  in 
a  large  n-dimensional  space.  On  the  other  hand, 
from  the  subgradient  relation  between  p  and  x, 
x  is  sparse  if  few  components  of  p  equal  1  or  —  1. 
The  l\  subgradient  p  takes  value  in  the  hyperbox 
[—1,  l]n.  The  faces  of  the  hyperbox  are  precisely 
the  vectors  which  have  1  or  —1  in  some  compo¬ 
nent  Having  more  components  equal  to  1  or  —1 
means  that  p  is  on  a  smaller  dimensional  face. 


For  example,  if  all  components  are  equal  to  1,  the 
hyperbox  face  is  just  a  single  point.  Therefore, 
Therefore,  when  the  dimension  of  range(AT)  is 
small,  it  is  unlikely  for  p  to  have  many  1  or  —  1 
components,  so  x  is  likely  sparse.  More  formal 
analysis  can  be  found  in  [28].  The  point  is  that 
x  is  sparse  because  p,  the  l\  subgradient  at  x, 
is  in  the  range  of  AT ,  not  because  of  the  usual 
properties  of  the  ^i-norm. 

One  of  the  main  advantages  of  (9)  is  how 
quickly  and  easily  it  computes  the  solution  path. 
As  argued  above,  it  can  be  computed  piece- 
wise,  and  every  piece  is  a  sign-constrained  least- 
squares  problem  (11)  that  is  similar  to  the  pre¬ 
vious  one,  so  one  can  warm-start  and  solve  it 
very  quickly  using  QR  updates.  There  are  also 
other  methods  to  obtain  an  approximate  solu¬ 
tion  path  even  faster:  for  example,  (discrete¬ 
time)  Bregman  iteration  [29,  30],  (continuous¬ 
time)  linearized  Bregman  ISS  [27],  and  (discrete¬ 
time)  linearized  Bregman  iteration  [30,  31]. 

Bregman  iteration  is  the  forward  Euler  itera¬ 
tion  of  (9): 

jJe+1=l/e-—AT(Axk-b),  pfc€d||z*||i, 
m 

which  is  the  optimality  condition  to 

xk+l  =  argmin D(x\xk)  +  -^-|| Ax  —  6||2,  (12) 
zeR"  2m 

where  D(x\  xk)  :=  ||x||i  —  \\xk'\\i  —  (pk,  x  —  xk)  is 
the  Bregman  distance  induced  by  td— norm.  In¬ 
terestingly,  after  a  change  of  variable,  (12)  re¬ 
duces  to  the  equivalent  “add-back-the-residual” 
iteration: 

xk+1  =  argmin  ||x||i  +  ^-|| Ax  —  bk II2,  (13a) 

zsR71  2m 

bk+1  =  bk  +  (b-Axk).  (13b) 

Each  iteration  of  (13a)  requires  minimizing  a 
problem  similar  to  (3)  except  that  residual  ( b  — 
Axk)  is  added  back  to  the  measurement.  This  is 
better  than  solving  (3)  and  tuning  its  A  because 
this  restores  lost  magnitude  in  the  signal.  See 
Figure  1. 

In  addition,  one  can  apply  an  existing  code 
for  (3)  to  solver  the  subproblem  (13a).  Further¬ 
more,  (13)  has  another  interesting  property  of 
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Figure  1:  Recover  x  (red)  from  Gaussian  noisy  measurements  b  =  Ax  +  e.  Left:  the  solution 
(blue)  of  model  (3)  (also  known  as  BPDN  or  basis  pursuit  denoising)  with  hand  picked  A  =  49  for 
best  sparsity-noise  tradeoff.  Right:  the  5th  iterate  (blue)  of  the  Bregman  iteration  (12)  or  (13). 
Conclusion:  while  true  x  (blue)  cannot  be  recovered  due  to  unknown  noise,  the  Bregman  solution 
(right  blue)  recovers  the  lost  magnitude  in  the  l\  solution  (left  blue)  and  three  additional  large 
elements,  near  the  right  end. 


error  forgetting  [32]:  the  subproblem  (13a)  can 
be  solved  inexactly  with  error,  but  the  errors  do 
not  accumulate;  instead,  they  cancel  each  other 
so  that  xk  still  converges  quickly. 

We  can  get  even  faster  linearized  Bregman  al¬ 
gorithms  by  smoothing.  Simply  add  to  (9) 
and  obtain 

p{t)  +  -x  =  -AT(Ax(t)  —  b),  p  €  <9||x(t)||i. 

Av 

(14) 

It  has  a  piece-wise  smooth  solution,  which  con¬ 
verges  to  the  unsmoothed  solution  exponentially 
fast  as  k  increases.  By  introducing  z  =  p  + 

(14)  reduces  to  an  ordinary  differential  equation: 

z(t)  =  — Rr(fi:^4shrink(2:(t),  1)  —  b ).  (15) 

There  is  no  inclusion  anymore.  This  is  because 
the  mapping  between  z  and  {p,  x)  is  one-one. 
Given  z,  we  uniquely  recover  x  =  /-i  shrink (z,  1) 
and  p  =  z  —  fx.  The  forward  Euler  iteration  of 

(15)  is  known  as  the  linearized  Bregman  itera¬ 
tion,  which  evolves  quickly  [33]  and  can  be  easily 
parallelized  for  problems  with  massive  amounts 
of  data  [34], 

All  we  have  discussed  in  this  section  general¬ 
izes  naturally  to  other  regularization  function  in 


place  of  the  i\  norm.  If  one  is  using  the  mini¬ 
mization  model: 

minimize  r  (x)  +  tf(x) 

where  r  enforces  a  solution  structure  and  /  is  a 
differentiable  data  fidelity  function,  we  encour¬ 
age  trying  the  ISS  system 

Pit)  =  -  fix),  p(t)  G  dr  (s(t)) , 

which  will  likely  reduce  bias  and  compute  a  solu¬ 
tion  path  quickly  while  still  keeping  the  desired 
structure  for  the  solution. 
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